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Abstract 

Cyanobacteria forged two major evolutionary transitions with the invention of oxygenic photosynthesis and the bestowal of photo- 
synthetic lifestyle upon eukaryotes through endosymbiosis. Information germane to understanding those transitions is imprinted in 
cyanobacterial genomes, but deciphering it is complicated by lateral gene transfer (LGT). Here, we report genome sequences for the 
morphologically most complex true-branching cyanobacteria, andforScytonemahofmanni PCC 71 10, which with 12,356 proteins 
is the most gene-rich prokaryote currently known. We investigated components of cyanobacterial evolution that have been vertically 
inherited, horizontally transferred, and donated to eukaryotes at plastid origin. The vertical component indicates a freshwater origin 
for water-splitting photosynthesis. Networks of the horizontal component reveal that 60% of cyanobacterial gene families have been 
affected by LGT. Plant nuclear genes acquired from cyanobacteria define a lower bound frequency of 61 1 multigene families that, in 
turn, specify diazotrophic cyanobacterial lineages as having a gene collection most similar to that possessed by the plastid ancestor. 
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Introduction 

Cyanobacteria are crucial players in Earth and life history 
because they generated the oxygen that has been present 
in the Earth's atmosphere for the last 2.4 billion years 



(Bekker et al. 2004) and because one uniquely fateful 
cyanobacterium became, via endosymbiosis, the ancestor of 
all plastids among photosynthetic eukaryotes (Gould et al. 
2008). Though they continue to impact global geochemical 
cycles through N 2 -fixation (Moisander et al. 2010), and the 
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sequestering of trace metals (Morel and Price 2003) as well as 
phosphorous (van Mooy et al. 2009), their main ecological 
significance is the oxygen-producing photosynthetic appar- 
atus that fuels most contemporary food chains. Their main 
evolutionary significance is that they mediated two pivotal 
innovations in life's history — water-splitting photosynthesis 
and the origin of primary plastids. Clues to both of those 
major evolutionary transitions should, in principle, be im- 
printed in cyanobacterial genomes. But reconstructing those 
events is not straightforward, because lateral gene transfer 
(LGT) redistributes genes among prokaryote genomes 
(Ochman et al. 2000), and among cyanobacterial genomes 
in particular (Raymond et al. 2002; Mulkidjanian et al. 2006; 
Dufresne et al. 2008; Shi and Falkowski 2008), over geological 
time. 

By necessity, and perhaps more so than for any other pro- 
karyotic group, LGT has always been hard-wired into the 
bigger picture of cyanobacterial evolution. To explain the 
origin of cyanobacterial water-splitting photosynthesis, both 
of the main competing theories require LGT to account for the 
distribution of photosystems across prokaryotic groups (Xiong 
and Bauer 2002; Hohmann-Marriot and Blankenship 2011). 
This is because the reaction centers of photosystems I and II 
clearly share common ancestry (Baymann et al. 2001; 
Hohmann-Marriot and Blankenship 2011), but without 
specifying how they entered the cyanobacterial ancestor 
genome. One theory posits that the two photosystems 
evolved in independent lineages and became merged in the 
founder cyanobacterium via LGT (Baymann et al. 2001), while 
the alternative has it that the photosystems diverged within a 
photosynthetic (protocyanobacterial) ancestor and were sub- 
sequently exported via LGT to some anoxygenic photosyn- 
thetic lineages (Xiong and Bauer 2002; Mulkidjanian et al. 
2006; Sharon et al. 2009). Compatible with a role for LGT 
in photosystem evolution is the finding that the genes for both 
photosystems I and II are mobile in marine phage metagen- 
omes (Lindell et al. 2004; Sharon et al. 2009). 

LGT also figures into the origin of plastids, because many 
genes were transferred from endosymbiont to host. 
Chloroplasts were once free-living cyanobacteria and con- 
tained approximately 2,000 proteins (Richly and Leister 
2004), a number comparable with a cyanobacterium, yet 
the genomes of modern plastids contain only 5-10% as 
many genes as those of their free-living cousins. This suggests 
that hundreds or thousands of the plastid ancestor's genes 
were either lost or relocated to the host nucleus during the 
course of plant evolution via endosymbiotic gene transfer 
(EGT) (Gould et al. 2008). Furthermore, the phylogenetic iden- 
tity of the plastid ancestor remains debated because of LGT. 
Different phylogenetic trees trace the plastid ancestor near the 
base of cyanobacterial diversification (Criscuolo and Gribaldo 
201 1), near coccoid cyanobacteria within the Synechococcus- 
Prochlorococcus (SynPro) clade (Reyes-Prieto et al. 2010), 
near the nitrogen-fixing Cyanothece clade (Deschamps et al. 



2008), or near filamentous, heterocyst-forming cyanobacterial 
lineages (Deusch et al. 2008). The simplest explanation for 
such findings — in an evolutionary context that incorporates 
LGT — is that the plastid ancestor donated one (chimeric) gen- 
ome's worth of genes to the host, and that LGT has been 
reassorting the homologs of these genes among free-living 
cyanobacterial and other prokaryote genomes ever since 
(Deusch et al. 2008). Because of LGT over time, the question 
of which "lineage" of cyanobacteria gave rise to the plastid 
loses meaning (Doolittle and Bapteste 2007), because 
the genomes and nature of the "lineages" have changed 
since the time of plastid origin over 1.2 billion years ago 
(Deusch et al. 2008; Gross et al. 2008). However, comparison 
of plant genes acquired from the plastid ancestor with cyano- 
bacterial homologs can reveal which modern cyanobacteria 
harbor a collection of genes most similar to that of the plastid 
ancestor. 

So far, missing in genomic studies of cyanobacterial evolu- 
tion are sequences from the group designated as subsection V 
(Rippka et al. 1979). Subsection V cyanobacteria grow as fila- 
ments that differentiate heterocysts (specialized N 2 -fixing 
cells), they produce cyst-like resting cells (akinetes) as well as 
differentiated motile trichomes (hormogonia), and most ex- 
hibit true branching. The developmental and morphological 
variety of subsection V cyanobacteria places them among the 
most complex of prokaryotes, for which reason they were 
even long thought to be the direct ancestors of all eukaryotes 
but only in the days before the endosymbiotic origin of plastids 
has been postulated (Mereschkowsky 1905) and eventually 
gained compelling support (Doolittle 1980). To better under- 
stand the role of subsection V species in cyanobacterial evo- 
lution and their possible relationship to the plastid ancestor, 
we have sequenced five genomes sampling a broad spectrum 
of filamentous, true-branching architecture (fig. ^A and B), 
and diverse geographical locations including rice fields in 
India {Fischerella muscicola PCC 73103 and Chlorogloeopsis 
fritschii PCC 691 2), and hot springs in New Zealand (F. musci- 
cola PCC 7414), Wyoming, USA (F. thermalis PCC 7521), and 
in Spain (C. fritschii PCC 921 2) (Rippka etal. 1979). In addition 
Scytonema hofmanni PCC 71 10, a Nostocales representative 
(subsection IV) isolated from a limestone cave (Crystal cave, 
Bermuda) (Rippka et al. 1979), whose filaments form false 
branches (fig. 10 and exhibit aerial growth, was included 
for comparison. 

Materials and Methods 

Cyanobacterial Cultures and DNA Isolation 

Stock cultures were maintained at 37°C on slants (or plates) in 
BG11o medium (Rippka and Herdman 2002), supplemented 
with 5mM NaHC0 3 and solidified with 0.9% (w/v) washed 
agar (Sigma, A 8678). For DNA isolation, cultures were grown 
at 37°C in BG1 1 medium (Rippka and Herdman 2002), with 
orbital shaking (100rpm) in an Infers Incubator, at a PPFD of 
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Fig. 1. — Genomes of Stigonematales and Scytonema. {A) Fischerella muscicola PCC 7414, forming true lateral branches. (B) Chlorogloeopsis fritschii 
PCC 6912, undergoing cell divisions in more than one plane but never producing lateral branches. Heterocysts and hormogonia, differentiated by members 
of both genera are marked by red and cyan arrows, respectively. (O Scytonema hofmanni PCC 7110 showing false branching filaments (black arrow) and 
heterocysts (red arrow). (D) Genomic features of the six novel sequenced genomes. Genomes have been deposited in NCBI under accessions (PRJNA1 04961 , 
PRJNA1 04963, PRJNA1 04969, PRJNA1 04967, PRJNA1 04965, and PRJNA1 57363). Fully annotated versions are available at www.molevol.de/resources. 
(E) Frequency distribution of protein coding genes in the new genomes, and comparison with other cyanobacterial genomes examined. 



30|imol quanta m~ 2 s~ 1 . Cultures were harvested after 
3-6 weeks of incubation, depending on density of the inocu- 
lum and the growth rates of the strains. DNA isolation from 
strains of Chlorogloeopsis was performed as described 
(Franche and Damerval 1988), with the addition of 1% 
Sarkosyl during lysozyme treatment to remove polysacchar- 
ides and a final RNA digestion step. Polysaccharide-free 
high molecular weight genomic DNA (gDNA) from strains 
of Fischerella was obtained by following a protocol for 
polysaccharide-rich plants (Sharma et al. 2002). 

Genome Sequencing and Annotation 

Prior to genome sequencing the identity of the gDNA was 
verified by sequencing of the 16S rDNA with primers 1 01 F 
(ACTGGCGGACGGGTGAGTAA) and 1047R (GACGACAGCC 
ATGCAGCACC), and comparison against cyanobacterial 
sequences available in NCBI. Genome sequencing was 



performed on the Genome Sequencer FLX using Titanium 
chemistry (Roche Applied Science, Penzberg, Germany) yield- 
ing a 10- to 32-fold coverage. Genome scaffolding was 
achieved by 3 kbp paired-end standard runs. The sequencing 
libraries were prepared from 4 jig of gDNA for whole genome 
shot gun sequencing and 5jig of gDNA for paired-end 
sequencing, according to the supplier's instructions. 
Additionally, a fosmid library was constructed with the Copy 
Control Fosmid Library Production Kit (Biozym Scientific, Hess. 
Oldendorf, Germany). Terminal DNA sequences of cloned 
genomic inserts were determined with an ABI 3730x1 DNA 
Analyzer (Life Technologies, Darmstadt, Germany). 
Furthermore, Sanger-reads were generated from fosmid 
clones to cover the gaps between contigs for each of the 
five genomes. Sequence data were assembled with the GS 
De Novo Assembler Software (ver. 2.0.01.14, 2.3, and 2.5.3). 
For each genome, large (>500bp) and small contigs 
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(<500bp) were obtained, including numerous repetitive 
elements and insertion segments. For finishing purposes, all 
DNA sequences were uploaded into the Consed program 
(Gordon et al. 1998). The final annotation including COGs 
(Tatusov et al. 2001) of the genome sequences was accom- 
plished with the GenDB software (Meyer et al. 2003). Gene 
prediction was performed by means of combining results of 
the software tools GLIMMER (Delcher et al. 1999), CRITICA 
(Badger and Olson 1999), and GISMO (Krause et al. 2007). 

Phylogenetic Analysis of Cyanobacterial Genomes 

Fully sequenced cyanobacterial proteomes were downloaded 
from NCBI version March/2011. For the reconstruction of 
cyanobacterial gene families, we conducted an all-against-all 
BLAST search (Ver. 2.2.17) (Altschul et al. 1997) using the 
protein sequences. Reciprocal best BLAST hits (rBBH) were 
performed using a threshold of E value < 10~ 10 and percent 
amino acid identity >30. For the clustering analysis, the overall 
protein sequence similarity between rBBH proteins, calculated 
as the percent of identical amino acids, was multiplied by the 
length ratio of the two proteins. Clusters of gene families were 
inferred from the rBBH similarity matrix using the MCL ver. 
1.008 clustering procedure (Enright et al. 2002), with the 
inflation parameter (I) set to 2.0. For the reconstruction of a 
consensus tree phylogeny, 324 gene families present as single 
copies in all cyanobacterial genomes analyzed were aligned 
with MAFFT (Katoh et al. 2002) ver. 6.717b. Phylogenetic 
trees were reconstructed using the Neighbor-Joining (NJ) 
approach (Saitou and Nei 1987). Protein sequence distances 
were calculated with PROTDIST (Felsenstein 1993), and 
applying the JTT substitution model (Jones et al. 1992). 
Phylogenetic trees were reconstructed with NEIGHBOR 
(Felsenstein 1993). The consensus phylogeny was recon- 
structed with CONSENSE (Felsenstein 1993). A concatenated 
alignment was reconstructed from the aligned protein se- 
quences, and all genes were weighted equally (supplementary 
fig. S1, Supplementary Material online). A phylogenetic tree 
was reconstructed from the concatenated alignment using 
the NJ approach and the software described as earlier. 
A phylogenetic network was reconstructed with SplitsTree 
Ver. 4.10 using the default parameters (Huson and Bryant 
2006). A minimal lateral network (MLN) was reconstructed 
using the consensus phylogeny as the reference tree, and 
the gene families described earlier according to the approach 
described in Dagan et al. (2008). Maximum likelihood phyl- 
ogeny was reconstructed using PhyML (Guindon et al. 2010) 
with LG model + 1 (estimation of invariant sites) + G (gamma 
distribution with 4 rate categories). Tree topology (SPR), 
branch length, and rate parameters were optimized. 

Phylogenetic Analysis of the Plastid Ancestor 

Sequences of nuclear-encoded proteins from the whole 
genomes of Arabidopsis thaliana, Oryza sativa subsp. japonica, 



Physcomitrella patens, Chlamydomonas reinhardtii, Enta- 
moeba histolytica, Dictyostelium discoideum, Filobasidiella 
neoformans, Saccharomyces cerevisiae, Schizosaccharomyces 
pombe, Drosophila melanogaster, Ciona intestinalis, Danio 
rerio, Gallus gallus, Canis lupus familiaris, and Homo sapiens 
were obtained from RefSeq database release November 2009 
(Pruitt et al. 2007). Nuclear proteomes of Cyanidioschyzon 
merolae version February 2005 (Matsuzaki et al. 2004), 
Ostreococcus tauri version 2.0 (Palenik et al. 2007), and 
Xenopus tropicalis release 4.1, August 2005 (Bowes et al. 
2008), were downloaded from the respective genome project 
websites. Additionally, 650 fully sequenced genomes of pro- 
karyotes, including those of 46 cyanobacterial representatives, 
were downloaded from NCBI RefSeq database release 
November 2009 (Pruitt et al. 2007). To avoid clustering 
artifacts of distantly related eukaryotic and prokaryotic se- 
quences, the sequences of cyanobacteria and photosynthetic 
eukaryotes were first clustered into separate sets of protein 
families. Matrices of algal/plant and cyanobacterial sequences 
were constructed from reciprocal best BLAST hits using an 
all-against-all BLAST, and thresholds of E-value< 10 -10 and 
amino acid sequence identities >25%. Clusters of homolo- 
gous protein sequences were reconstructed from each of the 
matrices using MCL (Enright et al. 2002) Ver. 08-312, 1 .008, 
with scheme = 7 and /= 2.0. Protein sequences of noncyano- 
bacterial prokaryotes and nonphotosynthetic eukaryotes were 
added to the plant/algal clusters of proteins, depending on 
their sequence homologies using the above threshold, and a 
limit of three sequences per phylum. Overlapping plant/algal 
and cyanobacterial clusters were joined. The sequences of 
protein families were aligned using MAFFT (Katoh et al. 
2002) Ver. 6.717b (2009/12/03). Multiple sequence align- 
ment quality was assessed using the HoT-method (Landan 
and Graur 2007). Plant/algal protein sequences with Sum of 
Pairs Score <80% were excluded from the cluster. 
Phylogenetic trees were reconstructed using maximum likeli- 
hood approach with PhyML (Guindon et al. 2010) and the 
best-fit model as inferred with ProtTest (Abascal et al. 
2005). The search for a best-fit model using ProtTest was re- 
stricted for nuclear gene substitution models including JTT 
(Jones et al. 1992) and WAG (Whelan and Goldman 2001) 
matrices. These were tested with all combinations of +l (esti- 
mation of invariant sites), +G (gamma distribution with 4 rate 
categories), and +F (using amino acid frequencies from the 
alignment) parameters. Branch lengths, model, and topology 
were optimized. From among 35,862 trees in total, WAG 
model was found as the best fit in 89% of the trees, with 
WAG + I + G as the more prevalent choice (34%). Genes of 
endosymbiotic origin in algal and plant genomes were inferred 
from the phylogenetic trees by searching for sisterhood be- 
tween cyanobacterial protein sequences and their counter- 
parts encoded by the nuclear genes of the photosynthetic 
eukaryotes (Martin et al. 2002). Protein families in the latter 
phototrophs were counted as having resulted from EGT(s), if 
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at least one of them had a cyanobacterial sequence as the 
nearest neighbor. Concatenated alignments were analyzed 
and used for tree construction by the same methods as 
described earlier. 

Results 

Genomes of Subsection V (Stigonematales) and 
Scytonema 

The genome size distribution of the five Stigonematales strains 
(5.9 ±2 Mb; fig. 1D) is similar to that of subsection IV mem- 
bers (Nostocales) (Larsson et al. 201 1). With only 5,340 CDSs, 
F. thermalis PCC 7521 has the smallest genome among the 
subsection V members, whereas the genome of 5. hofmanni 
PCC 71 10 (subsection IV) has 12,356 predicted ORFs, making 
it the most gene-rich prokaryote sequenced to date (fig. 1D). 
Clustering of all 223,941 CDSs encoded in 51 cyanobacterial 
genomes by protein sequence similarity resulted in 18,185 
cyanobacterial protein families and 47,174 singletons. 
Protein families with metabolic or cellular functions have 
significantly more duplicates in strains of subsection V than 
in those of subsection IV (P< 2.2 x 10~ 16 , paired t test). 
Subsection V and IV strains do not differ in gene copy 
number for information processing protein families 
(P=0.1 1, paired f test). The genome of strain PCC 7521 con- 
tains fewer duplicates (P<2.2x 10~ 16 , paired t test) than 
the other two representatives of Fischerella. The frequency 
of genes shared with other filamentous cyanobacteria 
and the distribution of gene function are similar (fig. 1E and 
supplementary fig. S2, Supplementary Material online) 
among the three phenotypically similar Fischerella strains 
(Rippka et al. 1979). 

Patterns of gene presence and absence might identify 
genes related to cyanobacterial morphological diversity 
(Stucken et al. 2010; Larsson et al. 2011). A subset of 
22 protein families is unique and common to all filamentous 
cyanobacteria in our sample (supplementary table S1, 
Supplementary Material online), only few of which have 
known function. Subsection V members share 7±1% of 
their proteome with those of subsection IV, and 73 protein 
families are specific to heterocyst-forming strains (supplemen- 
tary table S2, Supplementary Material online). Most of the 
remaining subsections IV- and V-specific genes fall into cell 
wall, membrane, and envelope biogenesis COGs, such as gly- 
cosyltransferases, exopolysaccharide synthesis, and secretion. 
Some of the subsection V-specific protein families might be 
involved in the multiseriate filament phenotype and formation 
of true branches. On average, only 2% of the proteins 
encoded in subsection V genomes are specific to 
true-branching forms. Only 46 gene families are uniquely 
shared among subsection V genomes (supplementary table 
S3, Supplementary Material online). Although their functions 
are yet unknown, their classifications entail mostly cell wall, 



membrane, envelope biogenesis, and signal transduction 
functions. The relative paucity of proteins comprising the 
core set of the true branching cyanobacteria suggests that 
this phenotype hinges upon very few expressed proteins, 
which may mainly affect regulation of cell division genes 
and/or localization of their products. 

Vertical and Lateral Components of Cyanobacterial 
Genome Evolution 

To reconstruct a cyanobacterial backbone phylogeny, we 
identified all 324 single-copy protein families common to 
all 51 cyanobacteria in our sample and reconstructed their 
phylogenetic trees. The consensus tree (fig. 2), rooted with 
Gloeobacter violaceus, indicates a single origin for the fila- 
mentous architecture, and the concatenated alignment 
(564,408 sites) yielded an identical topology with NJ (supple- 
mentary fig. S3, Supplementary Material online), where 
all branches are supported by 100% bootstrap replicates. 
Maximum likelihood reconstruction yielded a phylogeny in 
which filamentous cyanobacteria are polyphyletic (supple- 
mentary fig. S3, Supplementary Material online), the differ- 
ence to NJ being the position of Microcoleus chthonoplastes 
PCC 7420, a filamentous strain isolated from salt marshes 
(Rippka et al. 1979). Current whole-genome cyanobacterial 
phylogenies group Microcoleus with subsection I (Criscuolo 
and Gribaldo 2011), yielding paraphyly for filamentous 
forms. Although 55 of our 324 single copy gene trees support 
that position for Microcoleus, 1 1 1 recover filamentous mono- 
phyly, discrepancies that might reflect the workings of LGT 
(Raymond et al. 2002; Mulkidjanian et al. 2006; Shi and 
Falkowski 2008; Dufresne et al. 2008). To test the consistency 
of the backbone (consensus) phylogeny, we reconstructed a 
phylogenetic network using SplitsTree (Huson and Bryant 
2006). The resulting network reveals a paucity of conflicting 
splits in the data (supplementary fig. S4, Supplementary 
Material online). A total of 92 out of 21 2 splits are compatible 
with the NJ tree topology and their sum of split weight 
amounts to 96% of the total network; and thus, the NJ tree 
explains most of the split variability in the data. 

To estimate the degree and distribution of LGT in cyano- 
bacterial evolution, we reconstructed a MLN, which infers 
LGT frequencies by allowing increasing amounts of LGT per 
protein family across a given backbone phylogeny (here the 
consensus tree), and identifying for all gene families the LGT 
frequency at which the distributions of modern genome sizes 
and inferred ancestral genome sizes agree best (Dagan et al. 
2008). The MLN analysis conservatively assumes that all gene 
trees for all protein families are compatible (Dagan et al. 2008) 
and entails no gene tree comparisons. It revealed that 6,068 
(34%) of the cyanobacterial protein families require no LGT to 
account for their gene distributions, whereas 12,1 16 (66%) 
protein families have undergone at least one LGT event. 
Because the method does not tally conflicting gene trees for 
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Fig. 2. — Vertical and lateral gene evolution in cyanobacterial genomes. NJ consensus (or backbone) tree, inferred from 324 single-copy protein families 
common to all 51 cyanobacteria in our sample, and rooted with Gloeobacterviolaceus PCC 7421 . Branches indicating vertical gene evolution are indicated in 
black. The MLN is indicated by edges that do not map onto the vertical component, with number of genes per edge indicated by a color gradient from cyan 
(1 gene) to orange (736 genes). The phylogenetic position of the eukaryotic clade reconstructed using 23 core genes is marked by "a." The SynPro clade is 
marked by an arrow. 



homologous sequences, these are conservative lower bound 
estimates, in contrast to other recent studies (Raymond et al. 
2002; Mulkidjanian et al. 2006; Shi and Falkowski 2008; 
Dufresne et al. 2008). Our estimate is found in agreement 
with earlier quantification of LGT frequency among cyanobac- 
teria using an embedded quartets approach (Zhaxybayeva 
etal. 2006). 

The MLN is presented in figure 2, and shows vertical 
components of cyanobacterial evolution and a network of 
1,183 edges indicating laterally shared genes. Within the net- 
work, 358 edges (32%) represent a single laterally shared 
gene, whereas most edges (55%) carry <3 genes. Only 
91 (7%) of edges carry >20 genes. Thus, bulk transfers of 
tens of genes or more are rare. The clade of marine 
Prochlorococcus and Synechococcus (SynPro) strains, which 
are recognized as being closely related environmental special- 
ists of reduced genome size (Rocap et al. 2003; Dufresne et al. 
2008), appear to have the lowest LGT frequency. The inter- 
twined phylogenies within this clade (Zhaxybayeva et al. 2009) 
go undetected because the MLN is reconstructed from gene 
presence/absence data that are uninformative for the recon- 
struction of recombination events at the intra-species level 
(Dagan et al. 2008). The most highly connected nodes impli- 
cate the four contemporary strains Acaryochloris marina MBIC 
11017, Cyanothece PCC 7425, M. chthonoplastes PCC 7420, 



and 5. hofmanni PCC 7110 (fig. 2). Two of these strains, 
A. marina, an atypical marine unicellular cyanobacterium pro- 
ducing chlorophyll d as the primary photosynthetic pigment 
(Swingley et al. 2008), and M. chthonoplastes, a marine mat 
former, have the largest genomes (8.36 and 8.65 Mb, respect- 
ively) known for members of subsections I and III, and show an 
expansion of protein families (Larsson et al. 201 1). The MLN 
pinpoints large genomes as harboring gene pools that are 
frequently transferred among cyanobacteria, and identifies 
subsection V strains as being more highly connected with 
strains of subsections IV and III (1 .4 edges/node) than with 
unicellular strains (0.3 edges/node), also when strains of the 
SynPro clade are excluded (0.7 edges/node). This may suggest 
the existence of a LGT barrier between unicellular (mostly 
marine) and filamentous (mostly terrestrial) cyanobacteria. 

The Nature of the Plastid Ancestor 

To identify plant nuclear genes of cyanobacterial origin, 
we reconstructed 35,862 phylogenetic trees containing both 
eukaryotic and prokaryotic homologs and looked for trees 
in which plants and cyanobacteria branch together. In the 
present sample, considering all trees, between 8.7% and 
11.5% of all nuclear genes in photosynthetic eukaryotes 
sampled branch with cyanobacterial homologs (table 1). 
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Table 1 

Proportion of Plant Genes of Endosymbiotic Origin 





No. 
Proteins 




Total Tree Set 




CS>80% 


<3 homologues 


No. 
Trees 


No. Putative 
EGT 


EGT Bootstrap 
Support 


No. 
Trees 


No. Putative 
EGT 


No. Protein 
Families 


No. Putative 
EGT 


Arabidopsis thaliana 


30,897 


9,025 


801 (8.9%) 


87.89 ±20. 10 


3,306 


424 (12.8%) 


2,091 


136 (6.5%) 


Oryza sativa 


26,712 


7,292 


637 (8.7%) 


84.82 ±21.41 


2,596 


347 (13.4%) 


1,623 


95 (5.9%) 


Physcomitrella patens 


35,468 


8,847 


903 (10.2%) 


84.74 ±22. 11 


3,425 


542 (15.8%) 


1,402 


78 (5.6%) 


Ostreococcus tauri 


7,715 


3,495 


403 (11.5%) 


84.64 ±21.20 


1,232 


247 (20.1%) 


324 


26 (8.0%) 


Cyanidioschyzon merolae 


4,761 


2,688 


307 (11.4%) 


83.92 ±20.05 


844 


167 (19.8%) 


223 


15 (6.7%) 


Chlamydomonas reinhardtii 


14,262 


4,515 


478 (10.6%) 


83.81 ±21.04 


1,646 


283 (17.2%) 


599 


41 (6.8%) 


Total 


119,815 


35,862 


3,529 (9.8%) 


84.97 ±20.98 


13,049 


2,010 (15.4%) 


6,262 


391 (6.2%) 
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— Arabidopsis thaliana ■ Chlamydomonas reinhardtii 

Physcomitrella patens 1 Ostreococcus tauri 

Oryza sativa • Cyanidioschyzon merolae 

Fig. 3. — Phylogenetic characteristics of EGT inference. The frequency 
of EGT as inferred from alignments of varying reliability degrees. The dis- 
tribution of alignment reliability as estimated by column score (CS) is pre- 
sented in bars, colored according to the respective eukaryotes. The CS 
measure is calculated as the proportion of alignment sites whose recon- 
struction is independent upon the direction upon which the sequences are 
fed to the alignment algorithm (Landan and Graur 2007). The frequency 
of genes inferred as EGT is plotted above in the eukaryote-dependent 
strongly colored lines, with the proportions inferred from trees recon- 
structed by maximum likelihood and NJ approaches in solid and dashed 
lines, respectively. 



For the most reliable alignments, where false negatives are less 
likely, the proportion of genes acquired from plastids ranges 
between 16% of the genes in Arabidopsis genome and 
>20% of the genes in the smaller genomes of 
Ostreococcus and Cyanidioschyzon (fig. 3), with energy me- 
tabolism and carbohydrate metabolism (99 genes) being the 



most frequent functional categories (supplementary fig. S5, 
Supplementary Material online). Clearly, the quantitative 
contribution of cyanobacteria to plant genomes was great, 
and the backbone of plant metabolism was acquired from 
them — plants are, biochemically, cyanobacteria wrapped in 
a bigger box. 

To trace the nature of the plastid ancestor, we first 
assembled a dataset of 23 nuclear genes of plastid origin pre- 
sent in all plant and cyanobacterial genomes sampled. The 
tree of concatenated alignments, rooted by G. violaceus 
PCC 7421, shows a deep branch placing plastids basal 
among cyanobacteria (designated with an "a" in fig. 2). 
Expanding the data set to include 200 universal cyanobacterial 
gene families with a single, composite plant OTU (genes 
acquired from cyanobacteria and present in at least one 
plant) yielded the same long, deep branch. Long basal 
branches are characteristic of long-branch attraction (LBA), a 
well-known phylogenetic artifact. Compositional heterogen- 
eity such as AT bias and heterotachy can cause LBA (Lockhart 
et al. 2006), and a basal position due to an LBA often involves 
the grouping of strains in which strain-specific character states 
are abundant (Stiller and Hall 1999). The sequences of the 
23 universally distributed proteins in the six photosynthetic 
eukaryotes were found to contain significantly more unique 
substitutions than their cyanobacterial homologues 
(P=7 x 10 -66 , one-tailed Kolmogorov-Smirnov test, supple- 
mentary fig. S6, Supplementary Material online), and an 
examination of the larger set of 200 phylogenetic trees recon- 
structed for genes of endosymbiotic origin shows that the 
eukaryotic clade branch length is on average 10-fold larger 
than that of the cyanobacterial branches. The basal position of 
plastids among cyanobacteria in the concatenated alignment 
tree (fig. 2 and supplementary fig. S6, Supplementary Material 
online) is attributable to LBA. Worse, given that LGT is fre- 
quent among cyanobacteria (Raymond et al. 2002; 
Mulkidjanian et al. 2006; Shi and Falkowski 2008; Dufresne 
et al. 2008), there is no reason to suspect that any "core" 
gene phylogeny will be a faithful proxy for the rest of the 
genome (Doolittle and Bapteste 2007). 
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ShofPCC7110 
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CyaPCC8801 
MaerNIES843 
CyanPCC7424 
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Fig. 4. — Presence/absence and sequence similarity patterns of cyanobacterial protein families by comparison with their homologs of endosymbiotic 
origin in six photosynthetic eukaryotes. Amino acid sequence similarity between the cyanobacterial proteins (xaxis) and their counterparts in the eukaryotic 
plastid-derived set of protein families (y axis), as deduced for the genomes in the data set. Cell shades in the matrix correspond to the similarity ranking for 
each protein family (i.e., line) according to a color gradient from red (high similarity) to blue (low similarity). White cells correspond to genes lacking in the 
respective genomes. Protein families are ordered according to their distribution pattern into (A) nearly universal, (B) sparse representation or (0 highly 
frequent in the oceanic species, and (D) generally sparse representation. Cyanobacterial strains are ordered according to the MLN in fig. 2. 



Therefore, we turned our attention to the larger set of nu- 
clear genes of cyanobacterial origin whose homologs are not 
universally distributed among cyanobacteria. For 611 plant 
nuclear gene families identified as plastid acquisitions, we 
scored gene presence and absence, and protein sequence 
identity among cyanobacterial genomes (fig. 4). The SynPro 
clade lacks a substantial portion of these plastid ancestor gene 
families. A total of 245 (40%) protein families possessed by 
plants are absent in all Prochlorococcus strains, 1 37 (22%) are 
absent in all Synechococcus strains (fig. 4). The similarity map 
also shows that overall protein sequence similarity of plant 
nuclear genes is highest to homologs in members of subsec- 
tion IV and V. For 225 (37%) protein families, the average 
amino acid identity between the cyanobacterial genes and 
their plant homologs is significantly higher for subsection V 



genomes (a = 0.05, Kolmogorov-Smirnov test and FDR) than 
for subsection I genomes. When subsection IV and V genomes 
are combined and compared with those of subsection I, the 
value increases to 270 (44%) (a = 0.05, Kolmogorov-Smirnov 
test and FDR). Thus, subsection IV and V genomes harbor 
more homologs of genes that plants acquired from cyanobac- 
teria and those have higher sequence similarity to their plant 
homologs than genomes of subsection I. Similar amino acid 
usage in different organisms may sometimes lead to an over- 
estimation of species relatedness (Rodnguez-Ezpeleta and 
Embley 2012). Here, we tested for such possible bias using 
a principle component analysis (PCA) for the amino acid fre- 
quencies encoded by the 61 1 genes of endosymbiotic origin. 
The transformation of amino acid usage into two principal 
components explains in total 89% of the variability observed 
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(supplementary fig. S7, Supplementary Material online). 
Furthermore, the PCA reveals that the eukaryotic species do 
not group with the filamentous cyanobacteria; hence, the 
protein sequence similarity observed between those two 
groups is not a result of biased amino acid usage. 
Consequently, we can conclude that in the present sample, 
the collection of genes possessed by the ancestor of plastids 
was most similar to that in filamentous, heterocyst-forming 
cyanobacteria (fig. 2). 

Discussion 

Possible Initial Benefits of Plastids 

Today plastids supply fixed carbon to plant cells, but they also 
have a myriad of other functions in amino acid, lipid, and 
cofactor biosynthesis as well as nitrogen metabolism. What 
was the biochemical or physiological context of the symbiosis 
that gave rise to plastids — what initially associated the founder 
endosymbiont to its host in the first place? Traditional reason- 
ing on the selective advantage that was crucial to the estab- 
lishment of the plastid has it that the production of 
carbohydrates by the cyanobacterial endosymbiont was the 
key, a view that was clearly expressed by Mereschkowsky 
(1905, p. 605) in his initial formulation of endosymbiotic 
theory: "Plant cells receive with no effort whatsoever large 
amounts of preformed organic substrates (carbohydrates), 
which their chromatophores willingly supply." 

An alternative suggestion is that the initial advantage of 
plastids may have simply been their uniquely useful metabolic 
end product, 0 2 , as a boost to respiration in early mitochon- 
dria (Martin and Muller 1998). The chemical benefit of 0 2 
could, of course, have only been of value if the initial endo- 
symbiosis had taken place at a time in Earth's history, or in an 
environment, where 0 2 was not freely available in sufficient 
amounts. Fossil evidence supports the notion that the primary 
plastid endosymbiosis occurred at least 1 .2 billion years ago 
(Butterfield 2000) and molecular estimates suggest that plas- 
tids might have arisen by approximately 1 .5 billion years ago 
(Parfrey et al. 2011). Geochemists have found over the last 
decade that an approximately 2 billion year span of protracted 
ocean anoxia ended only about 580 Ma (Anbar and Knoll 
2002; Johnston et al. 2009; Lyons et al. 2009; Lyons and 
Reinhardt 2009; Sahoo et al. 2012). The six major eukaryotic 
assemblages or "supergroups" currently recognized, includ- 
ing plants, arose and diversified during that time (Parfrey et al. 

201 1) , that is, while the oceans were still anoxic (Muller et al. 

2012) . Such geological context (ocean anoxia during most of 
the Proterozoic) would be compatible with a possible role for 
0 2 as an initial benefit in the plastid evolution. Indeed, for 
Stanier (1970), the production of 0 2 was a reason to suggest 
that plastids arose before mitochondria did. Of course, 
Proterozoic ocean anoxia was likely less pronounced in the 
photic zone than below it (Johnston et al. 2009). A freshwater 



origin of plastids is also a possibility to consider, whereby the 
present data linking plastids phylogenomically more closely 
with freshwater cyanobacteria than with marine forms 
(fig. 2) would be compatible with that view. 

Another suggestion is that the key to establishment 
of the plastid was the origin of carbon translocators in 
the plastid inner membrane and that the incorporation of 
a metabolite antiporter like the triose phosphate translocator 
in the ancestral plastid membrane was the essential step 
for establishing the primary endosymbiosis by allowing the 
plant ancestor to profit from cyanobacterial carbon fixation 
(Weber et al. 2006). In the same vein, it was furthermore 
argued that the key to establishment of the plastid entailed 
the insertion of additional host-controlled metabolite 
exchange proteins into plastid membranes fulfilling a similar 
export role (Gross and Bhattacharya 2009). A problem 
with theories that focus on carbon exporters as the key 
innovation at plastid origin is that cyanobacteria are well 
known to produce copious amounts of exopolysaccharides 
(De Philippis and Vincenzini 1998), such that there would be 
no need to evolve or insert transporters for provision of 
carbohydrates to be realized by the host. 

The theory for the initial benefit of plastids that is currently 
best founded in direct observation, we would argue, is 
that nitrogen fixation was a key to the establishment of 
the symbiosis (Kneip et al. 2007). This view is supported by 
the circumstance that in modern symbioses involving 
cyanobacteria, nitrogen (not reduced carbon) is usually the 
key nutrient underlying the success of the partnership 
(Rai et al. 2000; Raven 2002). Accordingly, the cyanobacterial 
endosymbionts are nitrogen fixing forms and combined nitro- 
gen (ammonium) is the nutrient provided by the cyanobacter- 
ium. This is true for diatoms with N 2 -fixing cyanobacterial 
endosymbionts (Prechtl et al. 2004; Kneip et al. 2008), 
prymnesiophytes with associated N-fixing cyanobacteria that 
might be ectosymbionts (Thompson et al. 2012), cyanobionts 
in lichens (Rikkinen et al. 2002), coralloid roots of cycads 
(Costa et al. 2004), the angiosperm Gunnera (Chiu et al. 
2005), and the water-fern Azolla (Ran et al. 2010). In the 
case of Azolla and Rhopalodia, the N 2 -fixing cyanobacteria 
live as intracellular endosymbionts (Kneip et al. 2008; Ran 
etal. 2010). 

Recent studies have suggested that a filamentous pheno- 
type and heterocyst differentiation may have been hallmark 
phenotypic characteristics of the plastid ancestor (Deusch 
et al. 2008; Ran et al. 2010; Larsson et al. 201 1). Indeed, in 
modern cyanobacterial symbioses, fixed nitrogen is the main 
currency of benefit that the cyanobacterial symbiont provides 
to its host (Kayley et al. 2007). The early physiological associ- 
ation of the plastid ancestors with their host might thus have 
been similar to that of the unicellular nitrogen-fixing endosym- 
biont and its diatom host Rhopalodia (Kneip et al. 2008), or 
the highly reduced Nostoc azollae, an obligate cyanobiont of 
water-ferns, whose genome has drastically been reduced, 
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with a large portion of the remaining genes specifically dedi- 
cated to heterocyst differentiation and nitrogen fixation (Ran 
et al. 201 0). A potential problem with this view is that nitrogen 
fixation has not been retained by any modern plant (Allen and 
Raven 1996). Why not? One possible reason concerns the 
circumstance that cyanobacterial 0 2 production led to an oxi- 
dation state of the environment in which nitrate became 
abundant (Falkowski et al. 2008) — in a world of abundant 
nitrate, nitrogenase is less necessary, hence less likely to be 
retained, although one should recall that modern cyanobac- 
terial symbionts do fix nitrogen for their hosts. Perhaps, 
more importantly, in oxic environments cyanobacteria that 
express nitrogenase must exhibit either temporal separation 
of photosynthesis and nitrogen fixation (N 2 -fixation occurring 
mainly in the dark; Mitsui et al. 1986), or other means of 
protecting the notoriously 0 2 -sensitive enzyme from inactiva- 
tion such as diazocyte differentiation in Trichodesmium 
(Sandh et al. 2012), or heterocyst formation in subsections 
IV and V (Kumar et al. 2010). It is possible that such 
nitrogenase-protecting strategies, whereas readily accessible 
to genetically autonomous prokaryotes, are not among the 
realm of possibilities that plastids, which relinquished most of 
their genetic autonomy, can developmentally attain. 

Many Endosymbionts, or Only One with Many Genes? 

Gene transfer following plastid origin readily explains plant 
nuclear genes that branch with cyanobacteria. However, 
many plant-specific genes branch with other prokaryotes 
(fig. 5/\). Plant genes that branch with chlamydial homologs 
have led to the inferences that a chlamydial endosymbiont 
accompanied the origin of plastids (Brinkman et al. 2002; 
Huang and Gogarten 2007; Price et al. 2012). This theory 
postulates that the plant ancestor consumed cyanobacteria 
as food and was parasitized by environmental chlamydias 
(Huang and Gogarten 2007; Moustafa et al. 2008), whereby 
the chlamydias were key to establishing the plastid because 
chlamydia-like bacteria donated genes that allowed export of 
photosynthate from the cyanobacterial plastid ancestor and its 
polymerization into storage polysaccharide in the cytosol (Price 
et al. 201 2). The flaw with this theory is that it is based on the 
uncritical interpretation of computational results of genome 
comparisons that, as has long been known (Rujan and Martin 
2001 ; Martin et al. 2002; Esser et al. 2007; Dagan et al. 2008), 
would implicate many other groups of prokaryotes far more 
strongly than they would implicate chlamydias as active by- 
standers at the origin of plastids. The focus on chlamydia as 
opposed to, say spirochaetes or proteobacteria, is arbitrary 
and to some extent ad hoc. If one were to take the chlamydia 
theory seriously, or think it through in full, the transiently sym- 
biotic and gene-dealing "chlamydioplast" would have to take 
a number and wait in line next to the actinobacterioplast, 
the clostridioplast, the bacilloplast, the bacteriodetoplast, 
and the spirochaetoplast, and so forth, (fig. 5A). Beyond the 
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Fig. 5. — Taxon distribution of nearest neighbors to plant genes. 
(A) Tree samples distribute as following: Arabidopsis: 2,324; Oryza: 
1,792; Physcomitrella: 2,511; Ostreococcus: 968; Chlamydomonas: 
1,218; and Cyanidioschyzon: 693. Microbial taxonomic groups having 
a low frequency of nearest neighbors were grouped into the "Others" 
bar. Those include Aquificae, Dictyoglomi, Elusimicrobia, Fibrobacteres, 
Fusobacteria, Gemmatimonadetes, Korarchaeota, Nanoarchaeota, 
Nitrospirae, Tenericutes, Thaumarchaeota, and Thermotogae. (B) A com- 
parison of alignment quality (CS) between trees of Arabidopsis genes 
having a cyanobacterial nearest neighbor (black) and trees where a nearest 
neighbor from a different prokaryotic group was inferred (colored accord- 
ing to the taxa). In all groups but the Euryarchaeota, the alignment quality 
of trees where a noncyanobacterial nearest neighbor was inferred is 
significantly lower in comparison with tree topologies having cyanobac- 
teria as their nearest neighbor (using Wilcoxon test, a = 0.05). These re- 
sults suggest that the inference of noncyanobacterial nearest neighbors to 
plant genes is less reliable than the inference of cyanobacterial nearest 
neighbors. 
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cyanobacterial signal, which corresponds to a tangible double 
membrane-bounded and DNA-containing organelle, the 
other putative phylogenetic signals in the data, especially 
that involving chlamydia, are better explained in terms of 
known phenomena, such as LGT among free-living prokary- 
otes (Dagan et al. 2008) and by phylogeny reconstruction 
errors (White et al. 2007; Stiller 2011) (fig. 5B), both of 
which we know to really exist, than in terms of gene dealing 
endosymbionts whose existence is inferred from a few gene 
trees. The null hypothesis for endosymbiotic theory in the age 
of genomes should be: The ancestors of plastids underwent 
LGT, just like modern cyanobacteria, whose genomes are 
chimeras of genes from many sources (Mulkidjanian et al. 
2006), and the plastid ancestor genome was probably no dif- 
ferent (Richards and Archibald 201 1). LGT among prokaryotes 
accounts for the diverse sequence affinities of genes acquired 
from the single ancestor of plastids with far fewer corollaries 
than a one-symbiont-per-gene theory. We merely need 
to incorporate the effect that LGT among prokaryotes will 
have over geological time on the endosymbiotic origins of 
organelles. 

Clues to the Origin of Two Photosystems 

One notable aspect of cyanobacterial phylogenomics pre- 
sented in this study is that the marine cyanobacteria are not 
basal in the trees (fig. 2 and supplementary fig. S3, 
Supplementary Material online). These small unicellular cyano- 
bacteria (diameter 1 jim or less) share reduced genome sizes 
(<3Mb) as a common trait, and seem to have arisen from 
ancestors with larger genomes (Larsson et al. 2011) that, 
inferred from the phylogeny, lived in terrestrial, brackish, or 
perhaps freshwater environments (Sanchez-Baracaldo et al. 
2005). This led Blank and Sanchez-Baracaldo (2010) to sug- 
gest that oxygenic photosynthesis arose in a freshwater envir- 
onment. Our results support that view, and this conclusion has 
implications for the origin of water-splitting photosynthesis. 
Among many possibilities (Xiong and Bauer 2002; Hohmann- 
Marriot and Blankenship 2011; Williamson et al. 2011), it 
has been suggested that the progenitor of the cyanobacteria 
had genes for both type I (RCI) and type II (RCII) photosynthetic 
reaction centers (via gene duplication) but expressed either 
set of genes depending on the reducing conditions in 
the environment (Allen 2005): type RCI in the presence 
of H 2 S for noncyclic electron flow, as in Chlorobium (or the 
facultative anaerobic cyanobacterium Oscillatoria limnetica); 
and type RCII in the absence of H 2 S, for cyclic electron 
flow, as in Rhodobacter (Allen 2005). Were regulation to fail 
such that both type I and type II reaction centers became 
expressed in the absence of H 2 S, the protocyanobacterium 
would oxidatively perish, unless it could extract electrons 
from an environmentally available donor. 

Such an electron donor could have been aqueous Mn ll/IN , 
which has the utilitous property of being photo-oxidized by 



ultraviolet light (Allen and Martin 2007), an abundant com- 
ponent of solar radiation incident on the Earth's surface 
prior to accumulation of atmospheric oxygen. Attaining suit- 
ably high concentrations of Mn" /Nl as an environmentally 
available electron donor in the ocean would be problematic, 
but not in a freshwater setting. Allen et al. (2012) have 
recently shown that an engineered, Mn-binding type II re- 
action center of Rhodobacter sphaeroides will produce 0 2 
from 0 2 in the presence of Mn in a light-dependent reac- 
tion in which photo-damage is impeded in comparison 
with that in a wild-type, Mn-free reaction center. Their 
observation (Allen et al. 2012) is likely an important clue 
to the origin of oxygenic photosynthesis, at which time 
a protocyanobacterial type II reaction center acquired, via 
natural selection, the ability to (photo-)oxidize Mn" 71 " — itself 
ultimately rereduced by water — and then to reduce a newly 
constitutive type I reaction center. Transition from environ- 
mental (substrate) Mn" 71 " ions to the catalytic Mn 4 Ca cen- 
ter of cyanobacterial RCII would then have permitted 
light-dependent C0 2 and/or nitrogen fixation, in the 
absence of electron donors other than water. 

What Makes a Branching Cyanobacterium? 

The morphological diversity of cyanobacteria poses an 
intriguing question in the biology and evolution of cell differ- 
entiation. Transposon mutagenesis of Synechococcus elonga- 
tes PCC 7942 (subsection I) revealed that the loss of several 
genes involved in cell division leads to filament formation 
(Miyagishima et al. 2005). However, our analysis revealed 
that all recognized cyanobacterial cell division genes are pre- 
sent in the genomes of filamentous cyanobacteria, including 
those of subsection V. This suggests that the filamentous 
phenotype in cyanobacteria of subsections III, IV, and V is 
not due to loss of genes for cell division, though it is currently 
unknown whether those that are present are all expressed. 
Genes common to both unicellular and filamentous cyanobac- 
teria may also be important for determining trichome struc- 
ture in members of subsections lll-V. This is suggested by 
a recent study on the filamentous heterocystous strain 
N. punctiforme ATCC 29133 (Lehner et al. 2011), which 
showed that mutations of the amiC2 gene, encoding an ami- 
dase involved in septa formation, will lead to a morphology 
similar to that of colonial unicellular cyanobacteria, and 
prevent heterocyst differentiation. Furthermore, filament 
formation in 5. elongatus PCC 7942 can be induced by 
over-expression of the gene encoding FtsZ, which is known 
as a cell division protein (Mori and Johnson 2001). Thus, 
the lack of clear candidate genes whose distribution across 
cyanobacterial genomes correlate with cellular morphology 
and the experimental evidence that links between the expres- 
sion level (rather than presence/absence) of cell division 
proteins and filament formation suggest that a filamentous 
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phenotype may result from modifications of the gene regula- 
tory network and cell division program. 

Supplementary Material 

Supplementary figures S1-S7 and tables S1-S3 are available 
at Genome Biology and Evolution online (http:/A/wvw.gbe. 
oxfordjournals.org/). 
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